Skip to content

Report actual GCP zone in Google Batch trace records#6854

Closed
jonmarti wants to merge 1 commit intomasterfrom
nf-353-google-batch-report-actual-zone
Closed

Report actual GCP zone in Google Batch trace records#6854
jonmarti wants to merge 1 commit intomasterfrom
nf-353-google-batch-report-actual-zone

Conversation

@jonmarti
Copy link
Collaborator

Summary

Fixes #6646

The Google Batch API does not expose the actual zone where a task executes — it only provides the configured region (e.g. europe-west2). This means trace records report the region rather than the specific zone (e.g. europe-west2-a), which is important for cost analysis and debugging placement decisions.

This PR implements the workaround suggested by @pditommaso: querying the GCP instance metadata service from within the task to capture the real zone.

Changes

  • TaskRun.groovy — Added CMD_ZONE = '.command.zone' constant for the zone metadata file
  • GoogleBatchScriptLauncher.groovy — Inject a curl call to the GCP metadata endpoint (http://metadata.google.internal/computeMetadata/v1/instance/zone) in headerScript(), writing the result to .command.zone in the task work directory. The call is silent and non-fatal (2>/dev/null || true)
  • GoogleBatchTaskHandler.groovy — On task completion, read .command.zone, parse the zone name from the metadata format (projects/<id>/zones/<zone>), and update the CloudMachineInfo with the actual zone. Uses a volatile boolean zoneUpdated flag to ensure the file is read at most once (avoiding repeated remote I/O on gcsfuse)

Design decisions

  • Zone capture in headerScript() — This runs at the top of .command.run, which executes for both regular tasks and array task children (each child runs its own .command.run). The parent array launcher (.command.sh) does not capture zone since each child writes its own .command.zone
  • Graceful degradation — If the metadata service is unreachable (e.g. non-GCP environments, Fusion-enabled tasks that bypass GoogleBatchScriptLauncher), the trace record falls back to the configured region as before
  • Read-once semantics — The zoneUpdated flag is set in a finally block, so even if reading fails, we don't retry on every getMachineInfo() call (which is invoked frequently by the polling monitor, Tower observer, and error handlers)

Test plan

  • GoogleBatchScriptLauncherTest — 2 new tests: zone capture present in header script; zone capture absent from array launch command
  • GoogleBatchTaskHandlerTest — 6 new tests:
    • Zone updated from file on completion
    • Original zone preserved when file is missing
    • Null machineInfo returns null
    • Zone not updated when task is not completed
    • Malformed file content handled gracefully
    • Zone file read only once across multiple getMachineInfo() calls
  • All 76 tests pass (7 launcher + 69 handler), 0 failures

🤖 Generated with Claude Code

The Google Batch API does not expose the actual zone where a task
executes. Query the GCP instance metadata service from within the
task wrapper script to capture the real zone (e.g. europe-west2-a)
into .command.zone, and read it back upon task completion to update
the trace record. Falls back gracefully to the configured region
when the metadata service is unavailable.

Signed-off-by: Jordi Martínez <jmarti@seqera.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Jon Marti <jonathan.marti@seqera.io>
@netlify
Copy link

netlify bot commented Feb 23, 2026

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 41d5988
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/699c8bbad862f100077b029e

@jonmarti jonmarti requested a review from pditommaso February 23, 2026 17:18
@pditommaso
Copy link
Member

When I suggested this?! 😄

@jonmarti
Copy link
Collaborator Author

When I suggested this?! 😄

@pditommaso #6646 (comment) 😅

@jonmarti
Copy link
Collaborator Author

@pditommaso @munishchouhan I checked #6646 today as part of the resolution of https://seqera.atlassian.net/browse/NF-353 and eventually this ticket in Expedite: https://seqera.atlassian.net/browse/ES-190

@pditommaso
Copy link
Member

Likely an agent, not me 😄

@pditommaso
Copy link
Member

I want to explore if it can be avoided the use of an external file and rely on the API to detected the zone. See #6855

@jonmarti
Copy link
Collaborator Author

Closing in favor of #6855 — Paolo's approach of parsing the zone from StatusEvent descriptions is a better fit:

  • No core module changes (TaskRun)
  • No extra .command.zone file written to GCS per task
  • Works with Fusion-enabled tasks (independent of the script launcher)
  • Leverages data already available from the Batch API — the zones/<zone>/instances/<instance-id> pattern in state transition events is documented and already relied upon in this codebase for spot preemption detection

The only trade-off is one extra getTaskStatus() API call per task, but with the read-once flag it's negligible.

@jonmarti jonmarti closed this Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Google Batch: Report actual zone where tasks execute in trace records

2 participants